“ Missing is Useful ” : Missing Values in Cost - sensitive Decision Trees 1
نویسندگان
چکیده
Many real-world datasets for machine learning and data mining contain missing values, and much previous research regards it as a problem, and attempts to impute missing values before training and testing. In this paper, we study this issue in cost-sensitive learning that considers both test costs and misclassification costs. If some attributes (tests) are too expensive in obtaining their values, it would be more cost-effective to miss out their values, similar to skipping expensive and risky tests (missing values) in patient diagnosis (classification). That is, “missing is useful” as missing values actually reduces the total cost of tests and misclassifications, and therefore, it is not meaningful to impute their values. We discuss and compare several strategies that utilize only known values and that “missing is useful” for cost reduction in cost-sensitive decision tree learning. 1 This work is partially supported by Australian large ARC grants (DP0343109 and DP0559536), a China NSFC major research Program (60496321), and a China NSFC grant (60463003). • Shichao Zhang is with the Department of Computer Science at Guangxi Normal University, Guilin, China; and with the Faculty of Information Technology at University of Technology Sydney, PO Box 123, Broadway, Sydney, NSW 2007, Australia; zhangsc@ it.uts.edu.au. • Zhenxing Qin is with the Faculty of Information Technology at University of Technology Sydney, PO Box 123, Broadway, Sydney, NSW 2007, Australia; zqin@ it.uts.edu.au. • Charles X. Ling, Shengli Sheng are with the Department of Computer Science at The University of Western Ontario, London, Ontario N6A 5B7, Canada; {cling, ssheng}@ csd.uwo.ca.
منابع مشابه
Ordered Estimation of Missing Values
When attempting to discover by learning concepts embedded in data, it is not uncommon to nd that information is missing from the data. Such missing information can diminish the con dence on the concepts learned from the data. This paper describes a new approach to ll missing values in examples provided to a learning algorithm. A decision tree is constructed to determine the missing values of ea...
متن کاملCost Efficiency Measures In Data Envelopment Analysis With Nonhomogeneous DMUs
In the conventional data envelopment analysis (DEA), it is assumed that all decision making units (DMUs) using the same input and output measures, means that DMUs are homogeneous. In some settings, however, this usual assumption of DEA might be violated. A related problem is the problem of textit{missing} textit{data} where a DMU produces a certain output or consumes a certain input but the val...
متن کاملData Quality Improvement by Imputation of Missing Values
Having missing values in a data set is very common due to various reasons including human error, misunderstanding and equipment malfunctioning. Therefore, imputation of missing values is important to improve the quality of a data set. In our previous study we presented an imputation technique called DMI, which we then found better than an existing technique called EMI in terms of a few commonly...
متن کاملInvestigating the missing data effect on credit scoring rule based models: The case of an Iranian bank
Credit risk management is a process in which banks estimate probability of default (PD) for each loan applicant. Data sets of previous loan applicants are built by gathering their data, and these internal data sets are usually completed using external credit bureau’s data and finally used for estimating PD in banks. There is also a continuous interest for bank to use rule based classifiers to b...
متن کاملEstimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees
Classification performance can degrade if data contain missing attribute values. Many methods deal with missing information in a simple way, such as replacing missing values with the global or class-conditional mean/mode. We propose a new iterative algorithm to effectively estimate missing attribute values in both training data and test data. The attributes are selected one by one to be complet...
متن کامل